Impact of non-Poisson activity patterns on spreading processes 
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Halting a computer or biological virus outbreak requires a detailed understanding of the timing 
of the interactions between susceptible and infected individuals. While current spreading mod- 
els assume that users interact uniformly in time, following a Poisson process, a series of recent 
measurements indicate that the inter-contact time distribution is heavy tailed, corresponding to a 
temporally inhomogeneous bursty contact process. Here we show that the non-Poisson nature of 
the contact dynamics results in prevalence decay times significantly larger than predicted by the 
standard Poisson process based models. Our predictions are in agreement with the detailed time 
resolved prevalence data of computer viruses, which, according to virus bulletins, show a decay time 
close to a year, in contrast with the one day decay predicted by the standard Poisson process based 
models. 
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According to The WildList Organization Interna- 
tional (www.wildlist.org) there were 130 known computer 
viruses in 1993, a number that has exploded to 4,767 
in April 2006. With the proliferation of broadband "al- 
ways on" connections, file downloads, instant messaging, 
Bluetooth-enabled mobile devices and other communica- 
tions technologies the mechanisms used by worms and 
viruses to spread have evolved as well. Still, most viruses 
continue to spread through email attachments. Indeed, 
according to the Virus Bulletin (www.virusbtn.com), the 
Email worms W32/Netsky.h and W32/Mytob with the 
ability to spread itself through email, account for 70% 
of the virus prevalences in April 2006. When the worm 
infects a machine, it sends and infected email to all ad- 
dresses in the computer's email address book. This self- 
broadcast mechanism allows for the worm's rapid repro- 
duction and spread, explaining why email worms con- 
tinue to be the main security threat. 

In order to eradicate viruses, as well as to control and 
limit the impact of an outbreak, we need to have a de- 
tailed and quantitative understanding of the spreading 
dynamics. This is currently provided by a wide range of 
epidemic models, each adopted to the particular realities 
of the computer based spreading process. A common fea- 
ture of all current epidemic models 0, 0, S 0, S, @, S Q 
is the assumption that the contact process between indi- 
viduals follows Poisson statistics, meaning that the prob- 
ability that an agent interacts with another agent in a dt 
time interval is dt/(r), where (r) is the mean interevent 
time. Furthermore, the time r between two consecutive 
contacts is predicted to follow an exponential distribu- 
tion with mean (r). Therefore, reports of new infec- 
tions should decay exponentially with a decay time of 
about a day, or at most a few days P, [2, S fl @| > given 
that most users check their emails on a daily basics, pro- 
viding (t) of approximately a few days (see below). In 
contrast, prevalence records indicate that new infections 
are still reported years after the release of antiviruses 



( |http: / / www.virusbtn.com] [J, Q), and their decay time 
is in the vicinity of years, two-three orders of magnitude 
larger than the Poisson process predicted decay times. 

A possible resolution of this discrepancy may be rooted 
in the failure of the Poisson approximation for the inter- 
event time distribution, currently used in all modeling 
frameworks. Indeed, recent studies of email exchange 
records between individuals in a university environment 
have shown that the probability density function P(t) of 
the time interval r between two consecutive emails sent 
by the same user is we ll ap proximated by a fat tailed dis- 



tribution P(r) ~ t" 1 [la EH, H M, 0, El- In the fol 



lowing we provide evidence that this deviation from the 
Poisson process has a strong impact on the email worm's 
spread, offering a coherent explanation of the anoma- 
lously long prevalence times observed for email viruses. 

Email activity patterns: The contact dynamics respon- 
sible for the spread of email worms is driven by the 
email communication and usage patterns of individu- 
als. To characterize these patterns we studied two email 
datasets. The first dataset contains emails from a univer- 
sity environment, capturing the communication pattern 
between 3,188 users, consisting on 129,135 emails [lOj. 
The second dataset contains emails from a commercial 
provider (freemail.hu) spanning ten months, 1,729,165 
users and 39,046,030 emails. For the two email datasets 
P(t) is rather broad, following approximately a power 
law with exponent awl and a cutoff at large r values 
(Fig. [J). Most important, the value of the cutoff de- 
pends on the time window T over which the data has 
been recorded (Fig. Qli,b). By restricting the data to 
varying time windows we find that P(t) goes to zero as 
1 — t/T when r approaches T. After correcting for the 
finiteness of the observation time window we obtain that 
the distributions for different T values collapse into a 
single curve (Fig. [TJ:,d), representing the true inter-event 
time distribution. The obtained P(r) is well approxi- 
mated by a power law decay followed by an exponential 
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cutoff (Fig. E0>f), i.e. 



P E (r) = AT- a exp[ 



Te 



(1) 



where A is a normalization factor. The power law decay 
at small and intermediate r is clearly manifested on the 
log-log plot of P{t) (Fig. Qi:,d), consistent with awl, 
spanning over four (Fig. [Ip) to six (Fig. QJl) decades. 
The exponential cutoff is best seen in a semi-log plot 
after removing the power law decay (Fig. [U;,f), result- 
ing in a decay time te = 25 ± 2 days and te = 9 ± 1 
months (approximately 270 days) for the university and 
commercial datasets, respectively (see Fig. QJ-f). In 
contrast, the Poisson approximation predicts Pp(r) = 
exp(— t/(r))/(r) [l6j], where (r) is the mean interevent 
time, taking the values 0.86 and 4.9 days for the univer- 
sity and commercial data, respectively. 

The dynamics of worm spreading: To investigate the 
impact of the observed non-Poisson activity patterns on 
spreading processes we study the spread of email worms 
among email users. For the moment we ignore the pos- 
sibility that some users may delete the infected email or 
may have installed the worm antivirus and therefore do 
not participate in the spreading process, to return later 
at the possible impact of these events on our predictions. 
Therefore, the spreading process is well described by the 
susceptible-infected (SI) model on the email network. 

The spreading dynamics is jointly determined by the 
email activity patterns and the topology of the corre- 
sponding email communication network 0, [13]. The 
email activity patterns are reflected in the infection gen- 
eration times, where the generation time is defined as 
the time interval between the infection of the primary 
case (the user sending the email) and the infection of 
a secondary case (a different user opening the received 
infected email). From the perspective of the secondary 
case, the time when a user receives the infected email 
is random and the generation time is the time interval 
between arrival and the opening of the infected email. 
In most cases received emails are responded in the next 
email activity burst 1^, 13 1 , and viruses are acting when 
emails are read, approximately the same time when the 
next bunch of emails are written. Therefore the genera- 
tion time can be approximated by the time interval be- 
tween the arrival of a virus infected email, and the next 
email sent to any recipient by the secondary case. If we 
model the email activity pattern as a renewal process (l6j 
with inter-event time distribution P(t) then the genera- 
tion time is the residual waiting time and is characterized 
by the probability density function 
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FIG. 1: Distribution P(t) of the inter-event time between two 
consecutive emails sent by an email user. The left and right 
panels represent the university and commercial datasets, re- 
spectively. For each dataset we aggregate the interevent times 
of all users (the distribution for single users exhibits a similar 
behavior jl3j ]) and apply a logarithmic binning to account for 
the fact that the number of observed events decrease with in- 
creasing r. (a) and (b) Log-log plot of P(r) for r > 10 -2 days, 
emphasizing the large r behavior for different time window 
sizes T. (c) and (d) The same plots after removing the finite 
time window effects, the data collapsing into a single curve. 
The solid line represents the power law decay P(t) ~ r _1 . 
(e) and (f) Semilog plot emphasizing the exponential decay 
at large r, for the largest time window T. The solid lines 
are fit to an exponential decay tP(t) ~ e~ T ^ TE resulting in 
te = 25 ± 2 days and 9 ± 1 months for the university (e) and 
commercial (f) datasets, respectively. The outlayer in (f) was 
excluded when fitting to an exponential decay. 



single infected user at t = 0. Although the email network 
contain cycles, it is a very sparse, thus we approximate 
it by a tree-like structure. Previous analytical studies 
have shown that this approximation captures the main 
features of the spreading dynamics on real networks [3, 
[Lsl ]. In this case n(t) is given by [l8| 



D 



(3) 



1 f°° 
9{t) = dxP(x) . 



(2) 



Next we calculate the average number of new infections 
n(t) at time t resulting from an outbreak starting from a 



where Zd is the average number of users d email contacts 
away from the first infected user, D is the maximum of d 
a,ndg* d (t) is the d-order convolution of g(r), <?* 1 (t) = g(t) 
and g* d (t) = L dTg(r)g* d ~ 1 (t— r) for d > 1, representing 



3 




40 
t (days) 

FIG. 2: Average number of new infections resulting from sim- 
ulations using the email history of the university dataset (solid 
line), using a one day interval binning. The inset shows a 
zoom of the initial stages of the spreading process using a 
one hour interval binning. The lines correspond to the expo- 
nential decay predicted by the Poisson process approximation 
(dashed) and the true inter-event distribution (dot-dashed). 



the probability density function of the sum of d genera- 
tion times. Substituting the Poisson approximation and 
Email data interevent time distributions into |(3J) and the 
result into ([3]) we obtain 



n(t) = F(t) exp 



t 

To 



(4) 



where tq = (r) for the Poisson approximation and To = 
te for the Email data, and 



F(t) = I Fi ^ d=1 (d-iy- (<*>) 



Poisson approx. 
Email data , 



(5) 

where /(*) = dxx~ a e ( - tau - x '>/ TE / (t). In the long time 
limit Q is dominated by the exponential decay while 
F(t) gives just a correction. The decay time is, however, 
significantly different for the Poisson approximation and 
the real inter-event time distribution. 

To test these predictions we perform numerical simu- 
lations using the detailed email communication history. 
In this case a susceptible user receiving an infected email 
at time t becomes infected and sends an infected email 
to all its email contacts at t' > t, where t' is the time 
he/she sends an email for the first time after infection, as 
documented in the email data. To reduce the computa- 
tional cost we focus our analysis on the smaller university 
dataset. The average number of new infected users re- 
sulting from the simulation exhibits daily (Fig. [21 inset) 
and weekly oscillations (Fig. [2j main panel), reflecting 
the daily and weekly periodicity of human activity pat- 
terns. More important, after ten days the oscillations 
are superimposed on an exponential decay, with a decay 
time about 21 days (see Fig. QJ>). The Poisson process 
approximation would predict a decay time of one day, in 
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FIG. 3: Number new infections reported for six worm out- 
breaks, according to Virus Bulletin (www.virusbtn.com) . The 
lines are fit to an exponential decay resulting in the decay 
times (measured in months): LoveLetter (13 ± 2), Ethan 
(12 ± 1), Marker (14 ± 2), Class (12 ± 1), Melissa (13 ± 1), 
W32/Ska (11 ± 1). 



evident disagreement with the simulations (Fig. [2j. In 
contrast, using the correct inter-event time distribution 
for the university dataset we predict a decay time of 25±2 
days, in good agreement with the numerical simulations 
(Fig. U- 

The analysis of the university dataset allows us to 
demonstrate the connection between the long r behav- 
ior of the inter-event time distribution P(t) and the long 
time decay of the prevalence n(t). Our main finding is 
that the prevalence decay time is given by the characteris- 
tic decay time of the inter-event time distribution. More 
important, we show that the Poisson process approxima- 
tion clearly underestimates the decay time. For Poisson 
processes the two time scales, the average interevent time 
and the characteristic time of the exponential decay coin- 
cide, being both of the order of one to at most a few days. 
Using measurements on the commercial dataset, contain- 
ing a larger number of individuals and covering a wider 
spectrum of email users, we can extrapolate these con- 
clusions to predict the behavior of real viruses. Given 
the value of te for the commercial dataset we predict 
that the email worm prevalence should decay exponen- 
tially with time, with a decay time about nine months. 
The prevalence tables reported by the Virus Bulletin web 
site (http://www.virusbtn.com) indicate that worm out- 
breaks persist for several months, following an exponen- 
tial decay with a decay time around twelve months (Fig. 
[3J. Our nine month prediction is thus much closer to the 
observed value than the (t)(»1t4 day prediction based 
on the Poisson approximation. The fact that our predic- 
tion underestimates the actual decay time by about three 
months is probably rooted in the fact that the commercial 
dataset, despite its coverage of an impressive 1.7 million 
users, still captures only a small segment (approximately 
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0.1%) of all Internet users. 

As we discussed above, some other factors potentially 
affecting the spreading of email worms were not consid- 
ered in our analysis. First, some users may delete the 
infected emails or may have installed the worm antivirus. 
Since these users do not participate in the spreading pro- 
cess they are eliminated from the average number users 
Zd that are found d email contacts away from the first 
infected user. While this would affect the initial spread 
characterized by F(t) ([5|), the exponential decay in (j4]) 
and the decay time To = te will not be altered. Second, 
some email viruses do not use the self-broadcasting mech- 
anism of email worms. For example, file viruses require 
the email user to attach the infected file into a sent email 
in order to be transmitted. In turn, only some email 
contacts will receive the infected file. Once again, this 
affects Zd but not the email activity patterns. Therefore, 
the prevalence of email viruses in general should decay 
exponentially in time with a decay time To determined 
by the decay time of the inter-event time distribution te- 
Third, new virus strains regularly emerge following small 
modifications of earlier viruses. Within this work new 
virus strains are modeled as new outbreaks. An alterna- 
tive approach is to analyze all strains together, modeling 
the emergence of new strains as a process of reinfection. 
In this second approach the dynamics is better described 
by the susceptible- infected-susceptible (SIS) model [lj]. 
Earlier work has shown that if reinfections are allowed 
in networks with a power law degree distribution, long 
prevalence decay times may emerge, which increase with 
increasing the network size The data shown in Fig. 
[3] represent, however, the spread of a single virus strain, 
which is better captured by the SI model. For the SI 
model, however, for a Poisson activity pattern we should 
get a rapid decay in prevalence, indicating that the em- 
pirically observed long decay times cannot be attributed 
to this reinfection-based mechanism. 

A series of recent measurements indicate that power 
law inter-event time distributions are not a unique fea- 
ture of email communications, but emerge in a wide range 
of human activity patterns, describing the timing of fi- 
nancial transactio ns [l^ . l20l ] , response time of internauts 
21 1 , online games |22|, login times into email servers [23} 



and printing processes [24j]. Together they raise the pos- 
sibility that non-Poisson contact timing are a common 
feature of human dynamics and thus could impact other 
spreading processes as well. Indeed, measurements indi- 
cate that the patterns of visitation of public places, like 
libraries or the long range travel patterns of hu- 

mans, involving car and air travel, is also driven by fat 
tailed inter-event times 25]. Such travel patterns play 
a key role in the spread of biological viruses, such as 
influenza or SARS 26]. Taken together, these results in- 
dicate that the anomalous decay time predicted and ob- 
served for email viruses may in fact apply more widely, 



potentially impacting the spread of biological viruses as 
well. 
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